AITopics

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (0.99)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Neural Information Processing SystemsOct-2-2025, 14:22:29 GMT

More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation

Quanfu Fan, Chun-Fu (Richard) Chen, Hilde Kuehne, Marco Pistoia, David Cox

In general, Longer input sequences yield better recognition results.

architecture, artificial intelligence, machine learning, (14 more...)

Country: North America (0.28)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (0.99)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Neural Information Processing SystemsOct-2-2025, 14:22:15 GMT

The key contribution of our work is the development of an efficient and memory-2

We thank all the reviewers for their constructive comments. Please see our responses below. Our approach is purely based on 2D convolutions. We thank the reviewers for pointing out some related (or missing) references. Timeception, SlowFast and TSM are concurrent with our work.

artificial intelligence, convolution, key contribution, (16 more...)

Technology: Information Technology > Artificial Intelligence (0.77)

Neural Information Processing SystemsAug-15-2025, 10:54:00 GMT

Appendix: Convolutional Tensor-Train LSTM for Spatio-temporal Learning

Our experimental setup follows Wang et al.

dataset, prediction, window approach, (12 more...)

Country:

North America > United States > Maryland > Prince George's County > College Park (0.14)
North America > United States > California > Santa Clara County > Santa Clara (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

He, Yufei, Zhang, Xucong, Stienen, Arno H. A.

Gaze-Guided 3D Hand Motion Prediction for Detecting Intent in Egocentric Grasping Tasks

arXiv.org Artificial IntelligenceMar-27-2025

Human intention detection with hand motion prediction is critical to drive the upper-extremity assistive robots in neurorehabilitation applications. However, the traditional methods relying on physiological signal measurement are restrictive and often lack environmental context. We propose a novel approach that predicts future sequences of both hand poses and joint positions. This method integrates gaze information, historical hand motion sequences, and environmental object data, adapting dynamically to the assistive needs of the patient without prior knowledge of the intended object for grasping. Specifically, we use a vector-quantized variational autoencoder for robust hand pose encoding with an autoregressive generative transformer for effective hand motion sequence prediction. We demonstrate the usability of these novel techniques in a pilot study with healthy subjects. To train and evaluate the proposed method, we collect a dataset consisting of various types of grasp actions on different objects from multiple subjects. Through extensive experiments, we demonstrate that the proposed method can successfully predict sequential hand movement. Especially, the gaze information shows significant enhancements in prediction capabilities, particularly with fewer input frames, highlighting the potential of the proposed method for real-world applications.

artificial intelligence, machine learning, prediction, (16 more...)

2504.01024

Country: Europe > Netherlands > South Holland > Delft (0.04)

Genre: Research Report > Promising Solution (0.54)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.93)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Schmitt, Robin, Zeyer, Albert, Zeineldeen, Mohammad, Schlüter, Ralf, Ney, Hermann

The Conformer Encoder May Reverse the Time Dimension

arXiv.org Machine LearningOct-1-2024

We sometimes observe monotonically decreasing cross-attention weights in our Conformer-based global attention-based encoder-decoder (AED) models. Further investigation shows that the Conformer encoder internally reverses the sequence in the time dimension. We analyze the initial behavior of the decoder cross-attention mechanism and find that it encourages the Conformer encoder self-attention to build a connection between the initial frames and all other informative frames. Furthermore, we show that, at some point in training, the self-attention module of the Conformer starts dominating the output over the preceding feed-forward module, which then only allows the reversed information to pass through. We propose several methods and ideas of how this flipping can be avoided. Additionally, we investigate a novel method to obtain label-frame-position alignments by using the gradients of the label log probabilities w.r.t. the encoder input frames.

computational linguistic, sequence, speech recognition, (14 more...)

arXiv.org Machine Learning

2410.0068

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > Germany > North Rhine-Westphalia > Cologne Region > Aachen (0.04)
Oceania > Australia (0.04)
(10 more...)

Genre: Research Report (0.85)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Kirsten, Lucas Nedel, Fu, Zhicheng, Madhusudhana, Nikhil Ambha

MobileMEF: Fast and Efficient Method for Multi-Exposure Fusion

arXiv.org Artificial IntelligenceAug-15-2024

Recent advances in camera design and imaging technology have enabled the capture of high-quality images using smartphones. However, due to the limited dynamic range of digital cameras, the quality of photographs captured in environments with highly imbalanced lighting often results in poor-quality images. To address this issue, most devices capture multi-exposure frames and then use some multi-exposure fusion method to merge those frames into a final fused image. Nevertheless, most traditional and current deep learning approaches are unsuitable for real-time applications on mobile devices due to their heavy computational and memory requirements. We propose a new method for multi-exposure fusion based on an encoder-decoder deep learning architecture with efficient building blocks tailored for mobile devices. This efficient design makes our model capable of processing 4K resolution images in less than 2 seconds on mid-range smartphones. Our method outperforms state-of-the-art techniques regarding full-reference quality measures and computational efficiency (runtime and memory usage), making it ideal for real-time applications on hardware-constrained devices. Our code is available at: https://github.com/LucasKirsten/MobileMEF.

architecture, input frame, module, (16 more...)

2408.07932

Country:

Oceania > Australia (0.04)
Europe > United Kingdom (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Asia > Japan (0.04)

Genre: Research Report > New Finding (0.68)

Industry:

Semiconductors & Electronics (0.68)
Health & Medicine (0.48)
Media > Photography (0.34)

Technology:

Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceMay-21-2024

The Power of Next-Frame Prediction for Learning Physical Laws

Winterbottom, Thomas, Hudson, G. Thomas, Kluvanec, Daniel, Slack, Dean, Sterling, Jamie, Shentu, Junjie, Xiao, Chenghao, Zhou, Zheming, Moubayed, Noura Al

Next-frame prediction is a useful and powerful method for modelling and understanding the dynamics of video data. Inspired by the empirical success of causal language modelling and next-token prediction in language modelling, we explore the extent to which next-frame prediction serves as a strong foundational learning strategy (analogous to language modelling) for inducing an understanding of the visual world. In order to quantify the specific visual understanding induced by next-frame prediction, we introduce six diagnostic simulation video datasets derived from fundamental physical laws created by varying physical constants such as gravity and mass. We demonstrate that our models trained only on next-frame prediction are capable of predicting the value of these physical constants (e.g. gravity) without having been trained directly to learn these constants via a regression task. We find that the generative training phase alone induces a model state that can predict physical constants significantly better than that of a random model, improving the loss by a factor of between 1.28 to 6.24. We conclude that next-frame prediction shows great promise as a general learning strategy to induce understanding of the many `laws' that govern the visual domain without the need for explicit labelling.

dataset, next-frame prediction, prediction, (13 more...)

2405.1745

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.94)

Sahel, Yair Ben, Eldar, Yonina C.

Self-STORM: Deep Unrolled Self-Supervised Learning for Super-Resolution Microscopy

arXiv.org Artificial IntelligenceMar-25-2024

The use of fluorescent molecules to create long sequences of low-density, diffraction-limited images enables highly-precise molecule localization. However, this methodology requires lengthy imaging times, which limits the ability to view dynamic interactions of live cells on short time scales. Many techniques have been developed to reduce the number of frames needed for localization, from classic iterative optimization to deep neural networks. Particularly, deep algorithm unrolling utilizes both the structure of iterative sparse recovery algorithms and the performance gains of supervised deep learning. However, the robustness of this approach is highly dependant on having sufficient training data. In this paper we introduce deep unrolled self-supervised learning, which alleviates the need for such data by training a sequence-specific, model-based autoencoder that learns only from given measurements. Our proposed method exceeds the performance of its supervised counterparts, thus allowing for robust, dynamic imaging well below the diffraction limit without any labeled training samples. Furthermore, the suggested model-based autoencoder scheme can be utilized to enhance generalization in any sparse recovery framework, without the need for external training data.

dataset, reconstruction, self-storm, (15 more...)

2403.16974

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States (0.14)
Asia > Middle East > Israel (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)

Zhang, Xiaotong, Chong, Jinger, Youcef-Toumi, Kamal

How Does Perception Affect Safety: New Metrics and Strategy

arXiv.org Artificial IntelligenceDec-12-2023

Perception serves as a critical component in the functionality of autonomous agents. However, the intricate relationship between perception metrics and robotic metrics remains unclear, leading to ambiguity in the development and fine-tuning of perception algorithms. In this paper, we introduce a methodology for quantifying this relationship, taking into account factors such as detection rate, detection quality, and latency. Furthermore, we introduce two novel metrics for Human-Robot Collaboration safety predicated upon perception metrics: Critical Collision Probability (CCP) and Average Collision Probability (ACP). To validate the utility of these metrics in facilitating algorithm development and tuning, we develop an attentive processing strategy that focuses exclusively on key input features. This approach significantly reduces computational time while preserving a similar level of accuracy. Experimental results indicate that the implementation of this strategy in an object detector leads to a maximum reduction of 30.091% in inference time and 26.534% in total time per frame. Additionally, the strategy lowers the CCP and ACP in a baseline model by 11.252% and 13.501%, respectively. The source code will be made publicly available in the final proof version of the manuscript.

algorithm, neural network, probability, (15 more...)

2312.07744

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Europe > Italy (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.30)